Skip to content

[OpenVINO] Export DFlash for OpenVINO#1756

Open
ofirzaf wants to merge 12 commits into
huggingface:mainfrom
ofirzaf:dflash-qwen3.5
Open

[OpenVINO] Export DFlash for OpenVINO#1756
ofirzaf wants to merge 12 commits into
huggingface:mainfrom
ofirzaf:dflash-qwen3.5

Conversation

@ofirzaf
Copy link
Copy Markdown
Contributor

@ofirzaf ofirzaf commented May 30, 2026

What does this PR do?

We implement the support to export DFlash draft models for speculative decoding with OpenVINO.
Also, we implement hidden_states annotations in exported OV models to better support operations that require hidden_states as outputs from OV models (like DFlash/Eagle3) methods, that will be applied automatically to all models exported for text generation the graph doesn't change as this is only annotations.

Commands to export DFlash model with this PR:

optimum-cli export openvino \
  --model z-lab/Qwen3.6-Coder-35B-A3B-DFlash \
  --task text-generation-with-past \
  --trust-remote-code \
  --dflash-target-model Qwen/Qwen3.6-35B-A3B \
  --disable-convert-tokenizer \
  qwen3.6-35b-a3b-dflash-int8-ov

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

ofirzaf added 12 commits May 4, 2026 02:51
- Introduced `--dflash-target-model` argument for exporting DFlash draft models.
- Implemented `update_config_for_dflash` to handle DFlash-specific configurations.
- Enhanced model conversion and metadata handling for DFlash models.
- Added `DFlashDummyInputGenerator` for generating dummy inputs specific to DFlash.
- Updated tests to include DFlash model loading and export functionality.

This update enables the export and inference of models utilizing DFlash architecture, enhancing the OpenVINO integration.
- Removed the direct call to `_load_target_weights` in the constructor of `Qwen3DFlashForCausalLM`.
- Added a class method `from_pretrained` to handle loading weights and configurations more effectively.
- Updated weight handling to ensure compatibility with the target data type.
- Modified the `extract_dflash_debug_bundle.py` script to use `dtype` instead of `torch_dtype` and added `attn_implementation` parameter for draft model loading.

These changes improve the model's initialization process and enhance the flexibility of loading configurations.
…dels

- Introduced functions to check and annotate hidden states in models during export.
- Enhanced configuration to include hidden state outputs for models with multiple hidden layers.
- Implemented a test suite to validate hidden state annotations in exported OpenVINO models.

These changes improve the model export process by allowing the inclusion of hidden states, which is essential for certain text generation tasks.
- Implemented helper functions to find and add model outputs based on tensor names.
- Added a new test case to validate that annotated hidden state outputs match those from PyTorch for the GPT-2 model.
- Enhanced the export process to include hidden state outputs, ensuring compatibility with text generation tasks.

These changes improve the testing framework for OpenVINO model exports, specifically focusing on hidden state annotations.
- Added support for overriding the DFlash block size via the environment variable `DFLASH_BLOCK_SIZE_OVERRIDE`.
- Included error handling to ensure the block size is an integer greater than 1.
- This enhancement allows for more flexible configuration of DFlash model exports, improving usability and performance.

These changes contribute to the ongoing improvements in the OpenVINO export process for DFlash models.
- Added support for committed prefix cache policy in DFlash models by updating runtime information.
- Modified `DFlashDummyInputGenerator` to use "hidden_states" instead of "target_hidden" for input names.
- Updated Qwen3DFlash model to handle hidden states and past key values more effectively during inference.
- Introduced a new script to compare DFlash cache semantics between original and patched models.
- Enhanced tests to validate the integration of hidden states and ensure consistency in outputs.

These changes improve the functionality and testing of DFlash models within the OpenVINO framework, ensuring better performance and reliability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant